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1. Introduction 

Model selection is usually understood as selection of continuous explanatory variables. However, 
when a categorical predictor is considered, in order to reduce model’s complexity, we can either 
exclude the whole factor or merge its levels. 

A traditional method of examining the relationship between a continuous response and categor¬ 
ical variables is analysis of variance (ANOVA). After detecting the overall importance of a factor, 
pairwise comparisons of group means are used to test significance of differences between its levels. 
Typically post-hoc analysis such as Tukey’s honestly significant difference (HSD) test or multi¬ 
ple comparison adjustments (Bonferroni, Scheffe) are used. A drawback of pairwise comparisons is 
non-transitivity of conclusions. 

For example, let us consider data barley from R library lattice discussed already in Bondell 
and Reich (2009). Total yield of barley for 5 varieties at 6 sites in each of two years is modeled. The 
dependence between the response and the varieties variable with the use of Tukey’s HSD analysis 
(Figure 1) gives inconclusive answers: f3p = Pm , Pp = (3t, but f3p Pm- 

In this work we introduce a novel procedure called delete or merge regressors (DMR), which 
enables efficient search among partitions of factor levels, for which the issue of non-transitivity 
does not occur. If we apply DMR to the barley data, we get the following partition of varieties: 
{{5, M , V, P}, {T}}. Detailed description of the data set and the characteristics of the chosen model 
can be found in Section 5.5. 

The idea of partitioning a set of levels of a factor into non-overlapping groups has already been 
discussed in the literature. In the article Tukey (1949) a stepwise backward procedure based on 
the studentized range which gives grouping of means for samples from normal distributions was 
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95% family-wise confidence level 



Differences in mean levels of variety 


Figure 1: Results of Tukey’s HSD. 


proposed. Other methods of clustering of sample means were described in Scott and Knott (1974), 
where the set of means is partitioned from coarsest to finest, and in Calinski and Corsten (1985) 
whose algorithm adapts hierarchical clustering to the problem. In more recent articles Porreca 
and Ferrari-Trecate (2010) and Ciampi et al. (2008) efficient algorithms for datasets partitioning 
using generalized likelihood ratio test can be found. However, all the mentioned methods assume an 
arbitrary choice of significance level for the underlying tests. In our procedure we avoid the problem 
by selecting the final partition according to minimal value of information criterion. 

Information criterion as an objective function for partition selection is used in the procedures 
described in Dayton (2003). Dayton’s SAS procedure, called paired comparisons information cri¬ 
teria (PCIC), computes AIC and BIC values for all ordered subsets of independent means for 
both homogeneous and heterogeneous models. In contrast to DMR these methods do not allow for 
simultaneous factor partitioning and selection of continuous variables. 

A method introduced in Bondell and Reich (2009) called collapsing and shrinkage ANOVA (CAS¬ 
ANOVA) solves the same problem as DMR with use of the least absolute shrinkage and selection 
operator (Lasso; Tibshirani (1996)), where the L\ penalty is imposed on differences between param¬ 
eters corresponding to levels of each factor. This algorithm can be interpreted as a generalization 
of fused Lasso (Tibshirani et al. (2004)) to data with categorical variables. In Gertheiss and Tutz 
(2010) one can find a modification of CAS-ANOVA, which is more computationally efficient because 
of using the least angle regression algorithm (LARS; Efron et al. (2004)). Another algorithm, based 
on regularized model selection with categorical predictors and effect modifiers (Oelker, Gertheiss 
and Tutz (2012)) is implemented in R package gvcm.cat. It generalizes Lasso approach to simul¬ 
taneous factor partitioning and selection of continuous variables to generalized linear models. The 
algorithm is based on local quadratic approximation and iterated reweighted least squares. 


imsart-ejs ver. 2014/02/20 file: DMR.tex date: May 18, 2015 















A. Maj-Kanska et al./DMR for Linear Model Selection 


4 


We propose a backward selection procedure called delete or merge regressors (DMR), which 
combines deleting continuous variables with merging levels of factors. The method employs a greedy 
search among linear models with a set of constraints of two types: either a parameter for a continuous 
variable is set to zero or parameters corresponding to two levels of a factor are set to equal each 
other. In each step the choice of constraint is based on the order of squared t-statistics. As a result a 
nested family of linear models is obtained and the final decision is made by minimization of Bayesian 
information criterion (BIC). The method adapts agglomerative clustering, where squared t-statistics 
define the dissimilarity measure. This procedure generalizes concepts introduced in Zheng and Loh 
(1995) and Ciampi et al. (2008) . 

In the article we show that DMR algorithm is a consistent model selection method under rather 
weak assumptions when p tends to infinity with n. Furthermore, thanks to using a recursive formula 
for RSS in a nested family of linear models, the time complexity of DMR algorithm is just 0(np 2 ). 
This makes the algorithm much faster than the competitive Lasso-based methods. In the article we 
describe a simulation study and discuss a pertaining R package. The simulations show that DMR 
in comparison to adaptive Lasso methods described in the literature gives better results in terms 
of accuracy without the troublesome choice of the A grid. 

The remainder of the article proceeds as follows. The class of feasible models considered when 
performing model selection is defined in Section 2. DMR procedure is introduced in Section 3, while 
its asymptotic properties are discussed in Section 4. Simulations and real data examples are given 
in Section 5 to illustrate the method. All proofs are given in the Appendix. 

2. Feasible models 

In this section we first introduce some definitions regarding the form of the data and models 
considered. In particular, we define the set of feasible models, which are linear spaces of parameters 
with linear constraints and we show how by change of variables the constrained problem can be 
replaced by unconstrained one. Later we indicate that properties of OLS (ordinary least squares) 
estimators transfer to feasible models. 

2.1. Definitions 

Let us consider data generated by a full rank linear model with n observations and p < n parameters: 

y = X/3* + e = 1/4 + X o (3* 0 + Xift + ... + X ; /3;* + e, (1) 


where: 

1. e is a vector of iid zero-mean gaussian errors, e ~ J\f( 0, <r 2 I). 

2. X = [l,X 0l X 1 ,...,X i ] is a model matrix organized as follows: Xo is a matrix corresponding 
to continuous regressors and Xi,..., X; are zero-one matrices encoding corresponding factors 
with the first level set as the reference. 

3. (3* = [/4,/3S T ,/3f,...,/3r T ] T G R p is a parameter vector organized as follows: /4 ' s the 
intercept, (3* 0 = [/4,..., /3* o0 ] T is a vector of coefficients for continuous variables and /3* k = 
[,4,..., Pp k k\ T is a vector of parameters corresponding to the fc-th factor, k = 1,..., /, hence 
the length of the parameter vector is p = 1 + p 0 + (pi — 1) + ... + (pi — 1). 
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Denote sets of indexes: N = {0,1,, l}, N 0 = {0,1,... ,po} and N k = {2,3,... ,p k } for k £ 
N\ {0}. Let us define an elementary constraint for linear model (1) as a linear constraint of one of 
two types: 

H jk : 0j k = 0 where j £ N k \ {0}, k £ N, (2) 

Kijk ■ P*k = Pjk where i,j £ N k , i^j, k £ N\ {0}. (3) 

A feasible model can be defined as a sequence M = ( P 0 , Pi ,..., Pi), where P 0 denotes a subset of 
indexes of continuous variables and P k is a particular partition of levels of the k- th factor. Such a 
model can be encoded by a set of elementary constraints. A set of all feasible models is denoted by 
M. Let us denote model F £ A4 without constraints of types (2) or (3) as the full model. 
Example 1. For illustration, let us consider a model with one factor and one continuous variable: 


y = X(3* + e = 1 • 1 + X 0 • 2 + X x • 
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where Xo and e are vectors of length 8 generated independently from standard normal distribution, 
7V(0, 1 ). Then /3* = [1, 2 , - 2 , - 2 , 0 ] T . The full model F = (P 0 = {1}, Pi = {{1}, { 2 }, { 3 }, {4}}) with 
Po = l,pi = 4,p = 5. The model corresponding to /3* is (Pq = { 1 }, Pi — {{ 1 , 4}, { 2 , 3}}) and is the 
same as F with two elementary constraints: = 0 and P 21 = Pii- 


2.2. Unconstrained parametrization of feasible models 


A feasible model can be defined by a linear space of parameters 

Cm = {/3 £ : A 0 m/3 = 0} , 


( 5 ) 


where Ao m is a (p — q) x p matrix encoding q elementary constraints induced by the model. Such a 
constraint matrix can be expressed in many ways. In particular, every linear space can be spanned 
by different vectors. The number of such vectors can be greater than the dimension of the space 
when they are linearly dependent. In order to unify the form of a constraint matrix, we introduce 
the notion of regular form, which is described in the Appendix A. We assume that Aqm is in regular 
form. Let Aim be a q x p complement of Aqm to invertible matrix Am, that is: 


A m = 


Aim 

Aqm 


Denote: 


A -1 - 

A M ~ 


A 1 

^-M 


A 0 

a m 


( 6 ) 
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where is a p x q matrix. In order to replace a constrained by an unconstrained parametrization 
change of variables in model M is performed. Let f3 M £ Cm and = Aim Pm- We have: 

Pm = AmIm- (?) 


Indeed, 


Pm — A-^AmPm — ^-m 


A\m Pm 

r a 1 

A° 1 

4m 

AqmPm 

L 

J 

0 




From equation (7) we obtain X/3 M = Zim4m> where Z 1M = XAj^ and Cm = {Aj^ : £ £ R 9 }. 
Let us notice that Cm is a linear space spanned by columns of A X M . The dimension of space Cm 
will be called the size of model M and denoted by \M\. Note that \M\ = q. 

Example 1 continued. Matrices Am,Zim and £ M are: 
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1 -0.96 0 
1 -0.29 0 
1 0.26 1 
1 -1.15 1 
1 0.2 1 
1 0.03 1 

1 0.09 0 

1 1.12 0 


4m = (£i, 6,6) T , 4i = Poo, 6 = Plo , 6 = /?2i = Mal¬ 


one can see that a change from constrained to unconstrained problem was done by adding and 
deleting columns of the model matrix. 

The OLS estimator of (3 * constrained to Cm is given by the following expression: 


Pm ~ where — (^im^im) ZTmY- 


( 8 ) 


Note that A omPm = ^omA^ m = 0 and thus indeed (3 M £ Cm- We define the inclusion relation 
between two models Mi and M 2 by inclusion of linear spaces 


Mi C M 2 denotes Cm x Q £m 2 


(9) 


and intersection of two models Mi and M 2 by intersection of linear spaces: 

Mi n M 2 as a model defined by Cm 1 DCm 2 - (10) 

A feasible model M will be called a true model if (3* £ Cm- A true model with minimal size will 
be denoted by T. Observe that T is unique because X is a full rank matrix. 

Example 1 continued. For the illustrative example the true model T is T = ({1}, {{1,4}, {2, 3}}). 
The dimensions of the considered models are |F| = p = 5, \T\ = 3. 
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2.3. Residual sum of squares and generalized information criterion for feasible 
models 

Let H m = Zi m (Z^a/Zim) 1 Z J M . Observe that HmX/ 3* = X/3* for M D T . We define residual 
sum of squares for model M as RSSm = ||y — X/3 M || 2 . From equation (8) we have: 

RSS m = ||y - Z 1M ? M II 2 = 11(1- H M )y|| 2 . 

Let us denote: 

A m = /3* T X T (I - H m )X/3* = 11X/3* - X/3^|| 2 , (11) 

where f3* M = argmin^^ w ||X/3* — X/3|| 2 . Notice that (3 M ——->■ (3* M with n —> oo. The following 
decomposition of RSS in linear models is trivial, hence we omit the proof: 

Proposition 1. 

RSSm = A m + 2/3* T X T (I - H M )e + e T (I - H M )e. 

In particular for M D T 

RSS m = e T (I - H M )e ~ cr 2 X 2 _| M |- 

Therefore, the predictions for constrained problem can be obtained through projecting the ob¬ 
servations on the space spanned by columns of the model matrix for the equivalent unconstrained 
problem. Hence, decompositions and asymptotic properties of residual sums of squares for feasible 
models are inherited from unconstrained linear models. 

Bayes Information Criterion for model M is defined as: 

BIC m = n log RSS m + log(n) \M\. 

The goal of our method is to find the best feasible model according to BIC, taking into account 
that the number of feasible models grows exponentially with p. Since for the fc-th factor number of 
possible partitions is the Bell number B(pk), the number of all feasible models is 2 Po ni=i B(Pk)- 
In order to significantly reduce the amount of computations, we propose a greedy backward search. 

3. DMR algorithm 

In this section we introduce DMR algorithm. Because of troublesome notations, in order to make 
the description of the algorithm more intuitive, we present here a general idea of the algorithm. In 
particular, we give the details of step 3 of the algorithm in the Appendix B. 

Assuming that X is of full rank the QR decomposition of the model matrix is X = QR, where 
Q is n x p orthogonal matrix and R is p x p upper triangular matrix. Denote minimum variance 
unbiased estimators of (3 and a 2 for the full model F as: 

(3 = R _1 z and a 2 = ^ ^ , where z = Q T y. (12) 

n — p 

Let us denote 

(3 = [/3jfc]j E N k , R = [rjk.st] 3 € N k , 

k e n s e N t 

k,t e N 
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then 


and Vjk is a row of R 1 


Pjk = rj fc z, where j € N k ,k£ N 


Algorithm 1 DMR (Delete or Merge Regressors) 

Input: V,x 

1. Computation of t-statistics 

Compute the QR decomposition of the full model matrix, obtaining matrix R —1 , vector z and variance estimator 
a 2 as in equation (12). Calculate squared t-statistics: 

1. for all elementary constraints defined in (2): 

,2 ( r ^) 2 

ljk Var@ jk ) 5 2 ll r jfcll 2 

2. for all elementary constraints defined in (3): 

,2 _ (flik ~ /% fc ) 2 _ 

J Var(fiik — (3jk) <j2 |l r ifc — r jfcll 2 

for ijj G Nk, i ^ j , k G N \ {0}. 

2. Agglomerative clustering for factors (using complete linkage clustering) 

For each factor perform agglomerative clustering using D*. = . as dissimilarity matrix for k G N \ {0}: 

1. d ljk = d jlk = t\ jk for j e N k , 

2. dij k = tf jk for i,j S N k , i ^ j, 

3. d iik = 0 for i S N k . 

We denote cutting heights obtained from the clusterings as h| , h f,..., h) . 

3. Sorting constraints (hypotheses) according to the squared t-statistics 

Combine vectors of cutting heights: h = [0, hj, h^,..., where ho is vector of squared t-statistics for con¬ 

straints concerning continuous variables and 0 corresponds to the full model. Sort elements of h in increasing order 
and construct a corresponding (p — 1) x p matrix Ao of consecutive constraints. 

4. Computation of RSS using a recursive formula in a nested family of models 

Perform QR decomposition of the matrix R -T Aj obtaining the orthogonal matrix W = [wi,... , w p _i]. Set 
RSSm 0 — ||y|| 2 — ll z l| 2 f° r a model without constraints. For m — 1,... ,p — 1 

RSS Mm = RSS Mm _! + (W^z) 2 , 

where M m denotes a model with constraints defined by m first rows of Ao- The last formula is derived in the 
Appendix C, see equation (22). 

5. Choosing the best model according to BIC 

Calculate 

BICm™ = ™logRSSM m + {p — m) log(n) 

for m = 0,... ,p — 1. Selected model T is the model minimizing BIC among models on the nested path: 

T = argmin BICM m - 

M m 

0<m<p-l 


for j e JVfc \ {0}, k e N, 

(Rife - r j k ) T z ) 2 


Output: T 


The time complexities of successive steps of DMR algorithm are 0(np 2 ) for QR decomposition 
in step 1, 0(p 2 ) for hierarchical clustering in step 2, 0(p 3 ) for QR decomposition used in step 4. 
The dominating operation in the described procedure is the QR decomposition of the full model 
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matrix. Hence, the overall time complexity of DMR algorithm is 0(np 2 ). 
Example 1 continued. For the illustrative example we have: 


0 
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0.15 

3.09 
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I 131 

^231 

0 
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e 341 


4.52 

0.15 
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2.91 

t 2 

''141 

^241 

t 2 

''341 
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0.20 

3.09 

2.91 

0 


h = [0,0.15,0.20,8.01, 9.33] t , A, 


A)0 $10 $21 $31 $41 

- 0 0-11 0 - 
0 0 0 0 1 

0 0 10 0 
. 0 1 0 0 0 . 


BIC = [28.33,26.65,25.36,34.68,39.59] T . 


Observe that the selected model T is the true model T . The dendrogram and cutting heights for 
the illustrative example obtained from clustering in step 2 are shown in Figure 2. The horizontal 
dashed line corresponds to the optimal partition chosen by BIC. 


Cluster Dendrogram 


as.dist(t(x)) 
hclust (*, "complete") 


Figure 2: Dendrogram for Example 1. 
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4. Asymptotic properties of DMR algorithm 

In Algorithm 1 and all the simulations and examples we assumed complete linkage in hierarchical 
clustering and BIC for selection in the nested family of models. The proof of consistency is more 
general: the linkage criterion has to be a convex combination of the minimum and maximum of the 
pairwise distances between clusters (see equation 24 in Appendix D) and generalized information 
criterion is used for final model selection: 

GIC'm = n log RSS M +r n \M\, 

where r n is the penalty for model size. Note that well known criteria AIC and BIC are special cases 
of GIC, if r n = 2 and r n = log(n) respectively. 

In this section we use /„ -< g n to denote f n = o(g n ). We allow the number of predictors p n to 
grow monotonically with the number of observations n under the condition p n -< n. 

We distinguish the following subsets of the set of all feasible models M: 

1. Uniquely defined model T, which is fixed and does not depend on sample size. We assume that 
the model consists of a finite number of continuous variables and a finite number of factors 
with finite numbers of levels. 

2. A set A4y of models with one constraint imposed which is false: 

M v = {M C F : \M\ = \F\ - 1 and T £ M}, 

3. A set Mr of models with one constraint imposed which is true: 

M r = {M CF : \M\ = |F| - 1 and T C M}. 

We denote: 

A = min Am, (13) 

M£LM v 

where A m was defined in equation (11). Let us notice that from equation (8) we get 
Var m'J = A^Var (^m) = ^1/ (-^-m X t XA]^) 

Then 

Var (Vi (3 m - /3*)) = nA l M (A^X t XA j*) -1 Ajj. 

Additionally, for finite p, independent of n, if Lx T X —> 2 > 0 then 

Var (Vi (/^m ~ @ )) ^ y = ■ 

Theorem 1. Assume that X is of full rank and p n -< r n -< min(n, A). Let T be the model selected 
by DMR, where linkage criterion for hierarchical clustering is a convex combination of minimum 
and maximum of the pairwise distances between clusters. Then 

(a) lim^oo P(T = T) = 1, 

(b) yjn (/ 3^ — (3*^j —^4 A/”(0, cr 2 ’Sr) if additionally p is finite, independent of n and ^X T X —> 
S > 0. 

Proof can be found in the Appendix D. 
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5. Numerical experiments 

All experiments were performed using functions implemented in R package called DMR, which is 
available at the CRAN repository. The main function in the package is called DMR and implements the 
DMR algorithm with an optional method of hierarchical clustering (default is complete) and a value 
of r n in GIC (default is log(n)). The package also contains other functions that are modifications of 
the DMR algorithm, such as stepDMR which assumes recalculation of t-statistics after accepting 
every new elementary constraint and DMR4glm which can be used for model selection in generalized 
linear models. 

We compared 2 groups of algorithms. The first one contains 3 stepwise procedures stepBIC, ffs 
BIC and DMR. The second group are 2 Lasso-based methods: CAS-ANOVA and gvcm. Procedure 
stepBIC is implemented in the function stepAIC in R package MASS and does not perform factor par¬ 
titions but either deletes or keeps any of categorical predictors. A factor forward stepwise procedure 
(ffs BIC), implemented in R package gvcm. cat is similar to DMR but differs in the search direction 
(DMR is backward and ffs BIC is forward) and in the criterion of selection of the best step (DMR 
uses t-statistics calculated only once and hierarchical clustering and ffs BIC recalculates criterion in 
every step). For DMR the complete linkage method of clustering and BIC were used. Algorithm gvcm 
is implemented in R package gvcm. cat where by default there are no adaptive weights and crossval¬ 
idation is used for choosing the A parameter. We used adaptive weights and BIC criterion for choos¬ 
ing the tuning parameter since we got better results then. Implementation of CAS-ANOVA can be 
found on the website http://www4.stat.ncsu.edU/~bondell/Software/CasANQVA/CasANQVA.R. 
Here the default BIC was used for choosing the A parameter making all the methods dependent on 
the same criterion of choosing the tuning parameters. Adaptive weights are also default in CAS¬ 
ANOVA. When using the two Lasso-based algorithms we found difficult the selection of the A 
grid. In all the experiments we tried different grids: the default ones and ours both on linear and 
logarithmic scales presenting only the best results. 

We describe three simulation experiments. In Section 5.2 results regarding an experiment con¬ 
structed in the same way as in Bondell and Reich (2009) is presented. The model consists of three 
factors and no continuous variables. As a continuation, simulations based on data containing one 
factor and eight correlated continuous predictors were carried out, the results can be found in Sec¬ 
tion 5.3. In Section 5.4 we summarize the results of an experiment regarding generalized linear 
models. In this experiment only 4 algorithms were compared since CAS-ANOVA applies only to 
normal distribution. 

In Section 5.1 we introduce measures of performance which are generalizations of popular true 
positive rate and false discovery rate on categorical predictors. We call them TPR* and FDR*. 
In comparison to generalizations introduced in Gertheiss and Tutz (2010) and Bondell and Reich 
(2009), which we call TPR and FDR , our measures don’t diminish the influence of continuous 
predictors and factors with a small number of levels. Hence, for evaluation of the model selection 
methods we used following criteria: true model (TM) represents the percentage of times the pro¬ 
cedure chose the entirely correct model. Correct factors (CF) represents the percentage of times 
the non-significant factors were eliminated and the true factor was kept. 1—TPR, FDR, 1—TPR* 
and FDR* are averaged errors made by selectors described in Section 5.1. MSEP stands for mean 
squared error of prediction for new data and MD is mean dimension of the selected model, both 
with standard deviations. 

The last Section 5.5 refers to two real data examples where barley yield and prices of apartments 
in Munich were modeled. 
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5.1. Measures of performance 

When performing simulations, results are usually compared to the underlying truth. Traditionally, 
for model selection with only continuous predictors measures such as true positive rate (TPR) or 
false discovery rate (FDR) are used. In the literature (Gertheiss and Tutz (2010), Bondell and Reich 
(2009)) their generalization to both continuous and categorical predictors can be found. 

Let us consider sets of elementary constraints corresponding to the true and selected models 
determined by sets of indexes: 

B = {(i,j,k): i^j,i,j£N k ,k£N\{0}, (f3*)ik = {P*)jk} 
u{(j, k) : j £ N k , k£ N, (/ 3*) jk = 0} 

and 

B = {(i,j,k): j,i,j £ N k ,k £ N\{ 0}, 0f)ik = 0f)jk} 

U{(j, k) : j £ N k ,k £ N, 0f) jk = 0}. 

True positive rate is the proportion of true differences which were correctly identified to all true 
differences, meaning ratio of the number of true elementary constraints which were found by the 
selector to the number of all true elementary constraints TPR = \B C\B\/\B\. False discovery rate 
is the proportion of false differences which were classified as true to all differences classified as true, 
meaning ratio of the number of false elementary constraints which were accepted by the selector to 
the number of all accepted elementary constraints FDR = 1 — \B B\/\B\. 

However, measures defined in this way diminish the influence of the continuous variables and 
factors with a small number of levels. As an example, consider a model with 5 continuous predictors 
and one factor with 5 levels. Then the number of parameters for continuous predictors is 5 and the 
number of possible elementary constraints equals 5. The number of parameters for the categorical 
variable is also 5, whereas the number of possible elementary constraints is (®) = 10. 

We introduce a different generalization of traditional performance measures using dimensions of 
linear spaces which define the true and selected models. We consider two models: true model T and 
selected model T. 

We define true positive rate coefficient as TPR* = |TDT|/|T| and false discovery rate coefficient 
as FDR* = 1 — \T C\T\/\T\, where TOT is defined according to equation (10). This generalization 
is more fair since the influence of every parameter on the coefficients is equal. In the article the 
attention is focused on values: 1 — TPR* and FDR*, which correspond to the errors made by 
selector. 

5.2. Experiment 1 

The layout of this experiment is the same as in Bondell and Reich (2009). Despite using different A 
grids, we weren’t able to obtain as good results for CAS-ANOVA as in the original paper. However, 
the results for DMR are much better in terms of TM than those for CAS-ANOVA originally reported 
in Bondell and Reich (2009). The experimental model consists of three factors having eight, four 
and three levels, respectively. The true model is T = (Pi, P 2 , P3), where 

Pi = ({1,2}, {3,4,5,6}, {7,8}), P 2 = {1,2,3,4} , P 3 = {1,2,3}. 
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The response y was generated using the true model: 

y = H + £, £ ~ A/"(0,E), 


where 


ft =l„/? 0 * 0 + Xift + X 2 /3* 2 + X 3 (3* 3 
=ln • 2 + X^O, -3, -3, -3, -3, —2, -2) T + X 2 (0, 0,0) T + X 3 (0, 0) T . 

A balanced design was used with c observations for each combination of factor levels, which gives 
n = 96 • c, c = 1,2,4. 

The data was generated 1000 times. The best results for Acas-Anova = (0.1, 0.2,..., 3) T and 
Agvcm = (0.01, 0.02,..., 3 ) t together with outcomes from other methods are summarized in Table 1. 
The results of Experiment 1 indicate that DMR and ffs BIC algorithms performed almost twice 
better than CAS-ANOVA and gvcm in terms of choosing the true model. Our procedure and ffs 
BIC chose approximately smaller models with dimension closer to the dimension of the underlying 
true model, whose number of parameters is three. There were no significant differences between 
mean squared errors of prediction for all considered algorithms. The main conclusion, that DMR 
and ffs BIC procedures choose models which are smaller and closer to the proper one, is supported 
by the obtained values of 1 - TPR* and FDR*. 


Table 1 

Results of the simulation study, Experiment 1 . 


n 

Algorithm 

TM(%) 

CF(%) 

1-TPR 

FDR 

1-TPR* 

FDR* 

MSEPisd 

MDisd 

96 

DMR 

44 

73 

0.05 

0.09 

0.1 

0.19 

1.091T.179 

3.4±.7 


ffs BIC 

42 

73 

0.04 

0.09 

0.1 

0.2 

1.091±.179 

3.5T.7 


CAS-ANOVA 

17 

83 

0.04 

0.14 

0.06 

0.33 

1.104±.175 

5.5± 1.7 


gvcm 

11 

49 

0.08 

0.15 

0.1 

0.34 

1.118T.179 

4.5T1.6 


stepBIC 

0 

97 

0 

0.29 

0 

0.63 

1.089T.171 

8.1T.4 

192 

DMR 

66 

82 

0.01 

0.05 

0.02 

0.1 

1.036T.11 

3.3T.6 


ffs BIC 

67 

83 

0.01 

0.05 

0.02 

0.1 

1.035T.11 

3.3T.5 


CAS-ANOVA 

33 

93 

0 

0.09 

0.01 

0.24 

1.049T.109 

4.9T1.3 


gvcm 

27 

60 

0.01 

0.11 

0.02 

0.27 

1.049T.11 

4.3T1.2 


stepBIC 

0 

99 

0 

0.29 

0 

0.63 

1.046T.109 

8±.2 

384 

DMR 

80 

89 

0 

0.03 

0 

0.05 

1.013T.074 

3.2±.4 


ffs BIC 

79 

89 

0 

0.03 

0 

0.05 

1.013T.074 

3.2T.4 


CAS-ANOVA 

50 

97 

0 

0.06 

0 

0.17 

1.022±.074 

4.2T1.2 


gvcm 

49 

77 

0 

0.06 

0 

0.16 

1.02T.074 

3.8±1 


stepBIC 

0 

100 

0 

0.29 

0 

0.63 

1.022±.074 

8±.l 


An exemplary run of DMR algorithm is shown in Figure 3. The horizontal dotted line indicates 
the cutting height for the best model chosen by BIC. 

In Table 2 the computation times of the algorithms are summarized. All values are divided by the 
computation time of lm. f it function, which fits the linear model with the use of QR decomposition 
of the model matrix. 

The results for CAS-ANOVA and gvcm are given for only one value of A. By default, the searched 
lambda grid is of length 50 and 5001, respectively. One can see that DMR is significantly faster 
than ffs BIC, CAS-ANOVA and gvcm. 
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Figure 3: An examplary run of DMR algorithm for Experiment 1. 


Table 2 

Computation times divided by the computation time of Im.fit, results obtained using system, time function. 


c 

n 

DMR 

ffs BIC 

CASANOVA 

gvcm 

stepBIC 

1 

96 

87 

883 

234 

250 

71 

4 

384 

36 

526 

89 

245 

31 

20 

1920 

19 

394 

21 

739 

16 


5.3. Experiment 2 

In the second experiment a model containing not only categorical predictors, but also continuous 
variables is considered. The response y was generated from the model with one factor with eight 
levels and eight continuous variables: 

y =V 0 a 0 + Viai + e 

=v„(l, o, 1,0,1,0,1, Of + VfO, 0, —2, -2, -2, -2,4,4f + e, 

where Vo was generated from the multivariate normal distribution with autoregressive correlation 
structure with p = 0.8. The first 2-16-c rows were generated using mean vector (1,1, 0,0,0, 0, 0, Of, 
then 4 • 16 ■ c observations using mean vector (0,0,1,1,1,1, 0, Of and the last 2 • 16 ■ c observations 
using mean vector (0,0, 0, 0,0, 0,1, if, according to the underlying true partition of the factor, 
c = 1, 2,4, hence n = 128 • c. Vi is a matrix of dummy variables encoding levels of the factor and 
£ was generated from zero-mean normal distribution, e ~ A/"(0,I). The data was generated 1000 
times. 

The best results for Acas-Anova = (0.1, 0.2,..., 3f and A gvcm = (0.01, 0.02,..., 5f together 
with outcomes from other methods are summarized in Table 3. Despite the fact that additional 
continuous variables were correlated, the obtained results show a considerable advantage of DMR 
algorithm over other methods. 
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Table 3 

Results of the simulation study, Experiment 2. 


n 

Algorithm 

TM(%) 

1-TPR 

FDR 

1-TPR* 

FDR* 

MSEPisd 

MDisd 

128 

DMR 

68 

0 

0.03 

0 

0.05 

1.076T.148 

7.4T.6 


ffs BIC 

60 

0.01 

0.04 

0.01 

0.06 

1.081T.15 

7.3±.8 


CAS-ANOVA 

17 

0 

0.13 

0 

0.21 

1.11±.153 

9.9T1.6 


gvcm 

12 

0.02 

0.11 

0.01 

0.23 

1.113±.154 

8.2T1.5 


stepBIC 

0 

0 

0.25 

0 

0.42 

1.101T.148 

12.1T.4 

256 

DMR 

78 

0 

0.02 

0 

0.03 

1.033±.093 

7.2±.5 


ffs BIC 

54 

0 

0.03 

0 

0.07 

1.034±.093 

7.4T.8 


CAS-ANOVA 

27 

0 

0.1 

0 

0.16 

1.049± .096 

9.2T1.4 


gvcm 

24 

0 

0.07 

0 

0.17 

1.047±.096 

7.5±1.3 


stepBIC 

0 

0 

0.25 

0 

0.42 

1.049±.095 

12.1±.3 

512 

DMR 

88 

0 

0.01 

0 

0.02 

1.015±.066 

7.1T.4 


ffs BIC 

85 

0 

0.01 

0 

0.02 

1.016±.066 

6.9±.6 


CAS-ANOVA 

46 

0 

0.06 

0 

0.1 

1.024±.067 

8.4T1.2 


gvcm 

35 

0 

0.05 

0 

0.12 

1.021±,067 

7±1.1 


stepBIC 

0 

0 

0.25 

0 

0.42 

1.023±.067 

12±.2 


5-4■ Experiment 3 

Simultaneous deleting continuous variables and merging levels of factors can also be considered in 
the framework of generalized linear models. The problem has already been discussed in Oelker, 
Gertheiss and Tutz (2012), where L\ regularization was used. After replacing squared t-statistics 
with squared Wald’s statistics, DMR algorithm can be easily modified to generalized linear models. 
Simulation results for DMR algorithm for logistic regression are presented below. Let us consider 
a logistic regression model whose linear part consists of three factors defined as in Experiment 1. 
The response y was independently sampled from binomial distribution: 

i = l,...,n, 

1, ft = (/ii,. .., ft n ) T and n = 96 • c for 

C = i, 2,4,8. 

The results of the experiment are summarized in Table 4. The best outcomes for gvcm, presented 
in the table, were obtained for A grids A g v C m = (0.01, 0.02,..., 5) T . Again, DMR and ffs BIC show 
considerable advantage over other model selection methods. 

In Table 5 the computation times of the algorithms are summarized. All values are divided by 
the computation time of glm.fit function. The results for gvcm are given for only one value of A, 
while by default the searched lambda grid is of length 5001. DMR is again significantly faster than 
ffs BIC and gvcm. 

5.5. Real data examples 

Example 1: Barley. The data set barley from R library lattice has already been discussed in 
the literature, for example in Bondell and Reich (2009). The response is the barley yield for each of 
5 varieties (Svansota, Manchuria, Velvet, Peatland and Trebi) at 6 experimental farms in Minnesota 
for each year of the years 1931 and 1932 giving a total of 60 observations. The characteristics of the 
chosen models using different algorithms are presented in Table 6. The results for the full model 


Vi 


B 1 


exp (p.i) 


1 + exp(^) / 

where fii are elements of fi defined as in Experiment 
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Table 4 

Results of the simulation study for logistic regression, Experiment 3. 


n 

Algorithm 

TM 

CF 

1-TPR 

FDR 

1-TPR* 

FDR* 

MSEPisd 

MDitsd 

96 

DMR 

6 

62 

0.21 

0.15 

0.38 

0.35 

0.304±.049 

3.1T1.2 


ffs BIC 

7 

72 

0.21 

0.14 

0.37 

0.35 

0.302±.049 

3.1T.8 


gvcm 

0 

21 

0.18 

0.32 

0.27 

0.61 

0.317T.062 

6.4±2.9 


stepBIC 

0 

96 

0.00 

0.29 

0.00 

0.63 

0.299T.049 

8±.6 

192 

DMR 

25 

81 

0.16 

0.09 

0.25 

0.23 

0.296T.036 

3±.7 


ffs BIC 

21 

82 

0.17 

0.10 

0.28 

0.26 

0.293T.034 

3±.7 


gvcm 

1 

26 

0.15 

0.26 

0.19 

0.52 

0.296T.038 

5.8T2.6 


stepBIC 

0 

99 

0.00 

0.29 

0.00 

0.63 

0.291±.034 

8±.2 

384 

DMR 

55 

88 

0.06 

0.06 

0.12 

0.14 

0.29±.023 

3.1±.5 


ffs BIC 

51 

88 

0.06 

0.06 

0.12 

0.16 

0.29T.023 

3.2T.5 


gvcm 

6 

37 

0.08 

0.20 

0.10 

0.43 

0.289T.022 

5.5T2.5 


stepBIC 

0 

100 

0.00 

0.29 

0.00 

0.63 

0.289T.022 

8±.2 

768 

DMR 

79 

92 

0.01 

0.03 

0.03 

0.07 

0.29T.016 

3.1±.4 


ffs BIC 

79 

92 

0.01 

0.03 

0.03 

0.06 

0.29T.016 

3.1±.4 


gvcm 

20 

48 

0.01 

0.16 

0.02 

0.36 

0.289T.016 

5.2±2.2 


stepBIC 

0 

100 

0.00 

0.29 

0.00 

0.63 

0.29=t.016 

8±.l 


Table 5 

Computation times divided by the computation time of glm. fit, results obtained using system, time function. 


c 

n 

DMR 

ffs BIC 

gvcm 

stepBIC 

1 

96 

103 

399 

101 

40 

4 

384 

68 

398 

74 

28 

20 

1920 

49 

377 

101 

23 


which is least squares estimator with all variables were given as a benchmark. For the two Lasso- 
based algorithms we find difficult the selection of the A grid. Therefore, the results for CAS-ANOVA 
are given for two different grids: the first one chosen so that the chosen model was the same as the 
one described in Bondell and Reich (2009), Ai = (25, 25.01, 25.02,..., 35) T , and the second wider 
superset of the first one, Ai = (0.1,0.2, 0.3,..., 35) T . We used A 2 grid also for gvcm. 

The results show that stepwise methods give smaller models with smaller BIC values than the 
Lasso-based methods. The additional advantage of DMR and ffs BIC is lack of a troublesome tuning 
parameter. 


Table 6 

Characteristics of the chosen models for Barley data set. 


algorithm 

model dim 

R z 

adj. K z 

BIC 

full model 

ii 

.68 

.61 

416 

stepBIC 

ii 

.68 

.61 

416 

CAS-ANOVA A 2 

9 

.66 

.61 

411 

gvcm A 2 

7 

.66 

.6 

403 

CAS-ANOVA Ai 

6 

.61 

.58 

407 

ffs BIC 

5 

.64 

.61 

399 

DMR 

5 

.64 

.61 

399 


Example 2: Miete. The data set miete03 comes from http : //www. statistik.lmu.de/service/ 
datenarchiv. The data consists of 2053 households interviewed for the Munich rent standard 2003. 
The response is monthly rent per square meter in Euros. 8 categorical and 3 continuous variables 
give 36 and 4 (including the intercept) parameters. The data is described in detail in Gertheiss and 
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Tutz (2010). 

Model selection was performed using five methods: DMR, ffs BIC, CAS-ANOVA, gvcm and 
stepBIC. Characteristics of the chosen models are shown in Table 7 with results for the full model 
added for comparison. 


Table 7 

Characteristics of the chosen models for Miete data set. 


Selection 

method 

Model 

dimension 

K-* 

adj.it/* 

BIC 

Full model 

40 

.94 

.94 

23037 

CAS-ANOVA 

31 

.94 

.94 

22972 

gvcm 

26 

.94 

.94 

22933 

DMR 

12 

.94 

.94 

22833 

stepBIC 

11 

.94 

.94 

22847 


The reason of lack of results for ffs BIC in the part of Table 7 is that the algorithm required to 
allocate too much memory (factor urban district has 25 levels). 

We can conclude that DMR procedure and ffs BIC chose much better models than other compared 
methods in terms of BIC. However, DMR method can be applied to problems with larger number 
of parameters. 

6. Discussion 

We propose the DMR method which combines deleting continuous variables and merging levels of 
factors in linear models. DMR relies on ordering of elementary constraints using squared t-statistics 
and choosing the best model according to BIC in the nested family of models. A slightly modified 
version of the DMR algorithm can be applied to generalized linear models. 

We proved that DMR is a consistent model selection method. The main advantage of our theorem 
over the analogous one for the Lasso based methods (CAS-ANOVA, gvcm) is that we allow that 
the number of predictors grows to infinity. 

We show in simulations that DMR and ffs BIC are more accurate than the Lasso-based methods. 
However, DMR is much faster and less memory demanding in comparison to ffs BIC. Our results 
are not exceptional in comparison to others in the literature. In Example 1 in Zou and Li (2008) 
a similar simulation setup to our Experiment 1, n = 96, has been considered. The adaptive Lasso 
method (denoted there as one-step LOG) was outperformed by exhaustive BIC with 66 to 73 
percent of true model selection accuracy. We repeated the simulations and got similar results with 
76 percent for the Zheng-Loh algorithm (described in Zheng and Loh (1995)), which is DMR with 
just continuous variables. Thus, in the Zou and Li experiment the advantage of the Zheng-Loh 
algorithm over the adaptive Lasso is not as large as in our work, but Zou and Li used a better local 
linear approximations (LLA) of the penalty function in the adaptive Lasso implementation. Recall 
that both CAS-ANOVA and gvcm employ the local quadratic approximation (LQA) of the penalty 
function. 

The superiority of DMR over the Lasso based methods in our experiments not only comes from 
weakness of LQA used in the adaptive Lasso implementation. Greedy subset selection methods 
similar to the Zheng-Loh algorithm have been proposed many times. Recently, in Pokarowski and 
Mielniczuk (2013) a combination of screening of predictors by the Lasso with the Zheng-Loh greedy 
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selection for high-dimensional linear models has been proposed. The authors showed both theoreti¬ 
cally and experimentally that such combination is competitive to the Multi-stage Convex Relaxation 
described in Zhang (2010), which is least squares with capped l\ penalty implemented via LLA. 


Appendix A: Regular form of constraint matrix 

We say that Ao m is in regular form if it can be complemented to A m so that: 

A m = 


Aim 


I 

0 ' 

A 0 m 



I 


where B m is a matrix consisting of 0, —1,1. Then, using Schur complement we get: 

= [ A-m 


A — 


I 

— B M 


A 0 


(14) 


(15) 


Constraint matrix in regular form can always be obtained by a proper permutation of model’s 
parameters. Let us denote clusters in each partition: ?Mk = (pMikfi— 1 > where is the number of 
clusters, k £ N \ {0} and minimal elements in each cluster as juik = min{j £ C'Mik}- Let Pmo 
denote the set of continuous variables in the model. Sort model’s parameters in the following order: 

1- Poo, 

2- Pjo- j G Pmo \ {0}, 

3- PiMikk for * = 1,..., 4, i ± 1, k £ IV \ {0}, 

4. Pjo- j € A 0 \ Pmo, 

5- Pjk, j £ C'Mik \ {jMifc}, k £ N \ {0}. 

Sort columns of model matrix X in the same way as vector /3. 

Example 1. As an illustrative example consider a full model F = (Pfo, Pfi, Pf 2 ), where 

P F0 = {1,2}, Pfi = ({1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}), P F2 = ({1}, {2}, {3}) 

and po = 2,pi = 8 ,p 2 = 3 ,p = 12. We denote a feasible model with 7 elementary constraints: 
Pw = 0, P 21 = 0, Pn = 0, P 31 = /?5i, pn = Pqi, Pn = Ps 1 , P 22 = 0 as M = (Pmo,Pmi,Pm 2 ), 
where : 

P M0 = {2}, P M 1 = ({1,2, 7} , {3,5} , {4,6,8}), P M 2 = ({1,2} , {3}). 

Constraint matrix in regular form for model M, where each row corresponds to one of the 7 ele¬ 
mentary constraints, is: 


Aq m — 


Poo 

0 

0 

0 

0 

0 

0 

0 


P20 

0 

0 

0 

0 

0 

0 

0 


^31 

0 

0 

0 

-1 

0 

0 

0 


pil 

0 

0 

0 

0 

-1 

-1 

0 


P32 

0 

0 

0 

0 

0 

0 

0 


P10 

1 

0 

0 

0 

0 

0 

0 


P21 

0 

1 

0 

0 

0 

0 

0 


P71 

0 

0 

1 

0 

0 

0 

0 


P51 

0 

0 

0 

1 

0 

0 

0 


^61 

0 

0 

0 

0 

1 

0 

0 


Psi 

0 

0 

0 

0 

0 

1 

0 


P22 

0 ' 
0 
0 
0 
0 
0 
1 
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and after inverting matrix A M is obtained 


A^ 1 - 


A 1 


A 0 

A M 


Poo 

P20 

Psi 

P41 

P32 

P10 

P21 

p71 

p51 

P61 

p81 

P22 

■ 1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 - 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

0 

1 

0 

. 0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 . 


Notice that for regular constraint matrix Z m is the full model matrix X with appropriate columns 
deleted or added to each other. 

Appendix B: Detailed description of step 3 of the DMR algorithm 

Since step 3 of DMR algorithm needs complicated notations concerning hierarchical clustering, we 
decided to present them in the Appendix for the interested reader. In particular, we show here how 
the cutting heights vector h and matrix of constraints A 0 are built. 

Let us define vectors a(l ,j,k) and a (corresponding to the elementary constraints, being 
building blocks for Ao) such that : 

a(l ,j,k) = [a at (j,k)]seN t , a st (j,k) = l(s = j,t = k), (16) 

t e n 

a (i,j,k) = [a st (i,j,k)\ s e « t , a st {i,j,k) = l(s = i, t = k) - l(s = j, t = k). (17) 

t e n 

For each step s of the hierarchical clustering algorithm we use the following notation for the 
partitions of set {1} U N k = {1,2,... ,p fc }: 

Psk = {Ci s k} P iti S+1 , s = l,...,p k . 

We assume complete linkage clustering: 

d (C , i s _|_ 1 ,s+l,fc — Ci s sk U Cj s skiP'j 8 +i,s-\-l y k — P^o s sk) 

— max {d ( Ci s sk > C 0 s sk) ? d (Cj sSk , Co s sk^) } • 

Cutting heights in steps s = 1, ... ,p k — 1 are defined as: 

ksk — m )n d (Ciski C jgh) • 

Let us denote vector a sk as an elementary constraint corresponding to cutting height h sk , where: 

a sfc = a(i*, j*,fc), ** = min *, j* = min j and = argmind {C isk , C jsk ). 

ieC zlBk jeC J1Bk 
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Step 3 of the algorithm can be now rewritten: 

Combine vectors of cutting heights: h = [0, hj, hf,..., hf] T , where ho is vector of cutting heights 
for constraints concerning continuous variables and 0 corresponds to model without constraints: 

hfc = [MSS k e N\ {0} and h 0 = [0, t\ w , t 2 120 ,..., t 2 lpo0 ] T . 

Sort elements of h in increasing order getting h ; = [h m:p ]^ n= l an d construct (p — 1) x p matrix of 
constraints 

Ao = [&2:p, &3:p, • ■ ■ * &p:p] * 

where a m:p is the elementary constraint corresponding to cutting height h m:p . Then proceed as 
described in Algorithm 1. 


Appendix C: Recursive formula for RSS in a nested family of linear models 

In this section we show some implementation facts concerning the DMR algorithm. In particular an 
effective way of calculation of residual sums of squares for nested models using QR decompositions 
is discussed. 

Let us consider a linear model with linear constraints: 

£ = {/3 € R p , A 0 /3 = 0} , (18) 

where A 0 is (p — q) x p constraint matrix. The objective is to calculate residual sum of squares 
RSS = ||y — X/3|| 2 . QR decomposition of the model matrix is performed 

X = QR, 

where Q is n x p orthogonal matrix and R is p x p upper triangular matrix. Let us denote S = 
R _i Aq , then 

Q T y = R (3* + Q t £ and S T R/3* = 0. 

After substitution z = Q 2 y, 7 * = R/3*, r\ = Q T e we get 

z = 7 * + rj and U T W T 7 * = 0. (19) 

where W and U are respectively p x (jp — q) orthogonal matrix and (p — q) x (p — q) upper triangular 
matrix from the QR decomposition of matrix S. We have 

W T 7* = UU T W T 7* = 0. 

Let us denote W as orthogonal complement of W to matrix with dimensions p x p. We multiply 
equation (19) by [W,W]: 

[W, W] T z = [W, W] T 7 * + [W, W] t T 7 and W T 7 * = 0. 

Therefore the OLS estimator 7 of 7 * with constraints satisfies the following equation 

= [W,W] T 7_ (20) 


- T 

W z 
0 
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Multiplying (20) by [W, W], we obtain WW T z = 7 , then 

(I - WW T )z = 7 = R3- 


Let Q be an orthogonal complement of Q to matrix with dimensions n x n. The residual sum of 
squares for the model with linear constraints (18) can now be written as 

rss m =||Q 7 ' 'yf + ||Q T (y - x3 M )|| 2 = || y f - ||z|| 2 + ||q T y - r 3 m || 2 

= ||y|| 2 - ||zf + ||WW T z|| 2 = || y || 2 - ||z|| 2 + ||W T z|| 2 

=lly|| 2 - l! z ll 2 + E« z ) 2 > 


where w m is the m-th column of W. 

Denote by (Ao) miP , S mjP , W miP and U mjP submatrices of Ao, S, W and U respectively, obtained 
by retaining first m rows and p columns. Let us consider a nested family of feasible models M m , 
m = 0 ,... ,p — q defined as 

C-M m = {/3 £ K p , (A 0 ) miP /3 = 0} . 

For m = 0,... ,p — q we have 

S p ,m — 

because matrix U m>m is upper triangular. Since W^ m W P;7Jl = I, then W Pj . m U mjm is QR decom¬ 
position of S p>m . Then from equation (21) we get a recursive formula for residual sum of squares 
for nested models: 


RSSmo = l|y || 2 - ll z ll 2 , 

RSSM m = RSSm tti _ 1 + (w^,z ) 2 for m = 1 ,... ,p — 1 . 


Appendix D: Proof of Theorem 1. 

D.l. Properties of orthogonal projection matrices 

For a feasible model M let us define a following orthogonal projection matrix: 

H M = X(X t X)~ 1 A Im (A 0M (X t X)- 1 AJ m )“ 1 A 0 M (X t X)- 1 X t . 

Lemma 1. We have 

H m = — H m - 

Proof. For simplicity of notations in the remainder of this subsection we omit subscript M. Let 
Zi = XA 1 , Z = XA- 1 and Z 0 = XA°. We denote 


G = 


Gn G10 

Gqi Goo 


ZfZi ZfZ 0 
Z^Zi Z^Zo 


= Z 1 Z and G” 1 = 


G 11 G 10 
G 01 G 00 


Note that 


u F = x{x 1 xy 1 x 1 =xa- 1 (a- 1 x 1 xa- 1 )~ 1 a~ 1 x 1 = z(z J zy'z 1 . 
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Moreover 


(A 0 (X t X)~ 1 Aq ) _1 = (AoA- 1 (a~ t x t xa- 1 ) 1 A" t AJ 


-1 


[ 0 I ] (Z T Z)- 1 


0 

I 


-1 


= (G uu ) 


00\-l 


and 


A 0 (X t X)- 1 X t = A 0 A- 1 (Z t Z)- 1 A- t X t = A 0 A- 1 (Z t Z)- 1 Z t . 
Then we get from the Schur complement: 

U F - H m = Z{Z T Z)~ X Z T - Zi(Zf Zi) _1 Zf = ZG X Z T - ZiG^Zf 
= ZG~ 1 Z T - Zi(G n - G 10 (G 00 ) _1 G 10 )Zf 


= Z, 


G 11 

G 10 ' 


[ zf 1 

r v i 

■ G 11 — G 10 (G 00 ) _1 G 10 

0 


G 01 

G 00 _ 



— 1 ^0 J 

0 

0 



= z 


G 

G 


(G 00 )" 1 [ G 


01 


G 


oo 


T 7 \-1 


Z 1 = Z(Z J Z) 


0 

I 


z 

0 

Trrx-lrjT 


(G 00 )- 1 [01] (Z T Z)" 1 Z 


-1 


= X(X J X)-^ Am(X j X) _i A| f ) A„(X J X) _i X J = H 


M- 


□ 


D.2. Asymptotics for residual sums of squares 


Lemmas concerning dependencies between residual sums of squares have similar construction to 
those described in Chen and Chen (2008). Let us introduce some simplifying notations. For two 
sequences of random variables U„ and V n we write that U n <p V n if linin^oo P ( U n < V n ) = 1. 
Residual sum of squares for model M can be decomposed into three parts 

RSSm = ||y - H M y|| 2 =(X/3* + e) T (I - H M )(X/3* + e) 

=/3* T X T (I - H m )X/3* + 2/T t X t (I - Hjf)e + e T (I - H M )e. 


When TCMwe have H M X/3* = X/3* and RSS M = e T (I - H M )e. 
Lemma 2. Assuming p -< n and p -< r n , we have 

RSS t r n 
log pcc <p —• 


Proof. Observe that 


where 


RSS t 

RSS f 


RSSt - RSS f 
+ RSSf 


= 1 + ^ 
n 


e T (H i ,-H T )e n 
£ T (I-H/)£ ' p' 
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Let us notice that Hp — Hy is a matrix of an orthogonal projection with rank p — |T|. Therefore 
W\ = e T (H F - Hp)e ~ & 2 X 2 p -\t\ and W 2 = £ T (I - Hp)e — cr 2 x 2 _ p . Then we get 

E (^) = var (^) = 2 ^ P ~™ 

\ v ) p \ p ) p 

and since p grows monotonically with n we have either p n ^°°> qo, then Var ”-^°°> 0 and 

from Chebyshev’s inequality A 1 rr 2 in probability or p is bounded, then A 1 j s bounded in 

probability. Analogously for W 2 we have 

E ( W A = °~ 2 ( n ~ P) y ar ( W A = 2cr 4 (n — p) 

\ n J n ’ \ n ) n 2 

and since p -< n from Chebyshev’s inequality A 3 - n, ~^°°> cr 2 in probability. 

Therefore E n = Op(l) and jUJt = 1 + Op f ^ J . Hence 


log 



< ~E n 




□ 


Lemma 3. Assuming that p -< A (A is defined in equation (13)) we have for all 5 > 1 


, , (RSS m \ 

mm log - 

mgm v v V sss T J 


>P log 1 + 


Proof. Using the fact that 

and denoting 
where 


1 RSSt = £hVhl)£ = P 

n n 


Sa 2 ■ n 


Op(l) 


RSSm — RSSt = Ajvf + Sm + W T — Wm, 


A m = /3* t X t (I - H m )X/3*, S m = 2/3* T X T (I - H M )e, W T = e T H T £ and W M = e T U M e. 
Note that 

A m > A, S m ~ A/"(0, 4ct 2 A m ), W t ~ v 2 X\t\ and w m ~ <? 2 xl-i- 

Using assumption, and are op(l) from Chebyshev’s inequality. Since the dimension 

of the true model T is finite and independent of n, so is the number of models in A4y and we have 

Sm . Wt 


RSSm — RSSt = A m I 1 + 


A M A M 


\ = + °p(l)) > A^l + op(l)^. 


As a result 


, RSSm , /, , RSS m ~RSS t \ . , (, A \ f ^ 1 

log i!SS^ =l0 H - RSSt - j >f,log (. 1 + fo%J for4>1 - 


□ 
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Lemma 4. Assuming that p ~< A we have 

k fSr ( ‘ 0g RSSM ) <P «S, ( ‘ 0g RSSU )' 


Proof. For 5 > 1 let us denote a = log (l + then from Lemma 3 we get 

^rnin ^ log .RS'S'm) >plogRSST + a > ^max ^log RSSm^ + a 

'* max (log RSSm)- 
mgMt V / 


□ 


D.3. Ordering of squared t- statistics 


In this section we show that ordering of models M £ A4pUA4\> with respect to squared t-statistics 
is equivalent to ordering them with respect to the values of residual sum of squares. 

Let tM , where M £ Mj-UMp denote t-statistic for the full model with one elementary constraint 
Aq m/3 = 0. 


Lemma 5. If p P A, then 


max 

MgMt 


L M 


<p 


min 

MGMv 


t 


2 

M- 


Proof. From Lemma 1 we get that 


RSSm - RSS f = y T (H F - H M )y = f' Aj M (A 0 M (X T X)- 1 Aj M )- 1 A 0 A f3, 


where (3 = (X T X) x X T y. Hence for for each M £ A4p U A4y 

2 (A om 3) 2 _ (Aq m 3) 2 _ (A 0 m3) 2 _ RSSm-RSS f 

M Var(A 0M 3) A 0M Var(3)Aj M a 2 A 0M (X r X)-iAj M a 2 

where a 2 = . Observe that Aqm is 1 x |F| matrix, thus 

2 , , RSSm — RSSp 
t M = (n~ 1 ^ 1 )- wwp -, 

ito o F 

and from Lemma 4 we get the conclusion. □ 


D.f. Correct ordering of constraints using hierarchical clustering 

In this subsection we state conditions under which the true model T belongs to the path of nested 
models obtained in step 4 of DMR algorithm. 

Temporarily let us limit the analysis to a model consisting of one factor and no continuous 
variables. The true partition of set pi} will be denoted by P* = (Cf )(J,. We say that 
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distance matrix D = [d 1:J ] j 7 is consistent with the true partition if dissimilarity measures for elements 
within the same clusters are smaller than for elements from different clusters: 


max max da = d true < d^ alse = min min da. (23) 

ie{i,...,|T|}ijec* h,i 2 e{i,...,\T\}iec*jec? 

h^h 

Let P s i = (a s i)f=r +1 denote a partition of set {l,...,pi} in step s of hierarchical clustering 
algorithm, s = 1,... ,p\. We will name aggregation of Ci aS \ and Cj aS i in step s compatible with 
the true partition P* if there exist l £ {1,..., |T|}, i s+1 £ {1,... ,pi — s} and i s ^ j s , i s ,j s £ 
{1,.. .pi — s + 1} such that 


C is+lS+ u — C isSl U Cj sS i , C is+lS+ll C C*. 

Cutting height in step s is defined as h s \ = d(Ci sS i, Cj sS i) if Ci aS i and Cj gS \ are aggregated in this 
step, hi = (hn ,..., /ipj-ip). 

Lemma 6. Assuming that the linkage criterion of hierarchical clustering algorithm satisfies: 

d (Ci s _|_ lS _|_ifc — Ci s sk C Cj B sk, — C 0 sS fc) 

b min {d ( P^i s sk ? ^do B sk ); d (C j s sk ? (d 0s sk )} (24) 

d - (1 b) max {d ( Ci sS ] i , C QsS k ), d ( Cj s ski C 0 sS fc)} , 


where b £ [0,1] and the dissimilarity matrix has property (23), then the cutting heights for aggrega¬ 
tions compatible with Pf are lower than d true and cutting heights for aggregations not compatible 
with Pf are larger than d^ alse . 


Proof. From (23) if \T\ = pi the statement holds trivially and if \T\ < pi aggregation in the first 
step is compatible with Pf. We assume that in step s aggregation is compatible with the true 
partition with cutting height not greater than d true . If aggregation of Ci s+lS +ip = Ci sS i U Cj sS i 
and Cj. l 5 +i,i = C 0sS i is compatible with P* then 


h s i — d (C'i s+lS +ii, 
If aggregation of C is+is+n 

h s i — d s-j-ii, 


C js+is+ h) < ma x(d{Ci. sl ,C 0 . el ),d(C j . el ,C 0 . 8l )) < d true 
= Ci^siUCjssi and Cj s , lS j r n = C 0aS \ is not compatible with Pf then 
C js+is+ u) > mm(d(C iaSl ,C 0 aSl ),d(C jaSl ,C 0aSl )) > d false 


Hence, cutting heights hn,..., h Pl _\T\,i not greater than d true are used until all aggregations 
compatible with P( are performed. We have C Pi _\t\+ii = P( and in steps s = pi — \T\ + 2 ,... ,pi 
the true partition P* is a subpartition of C s i and cutting heights /i pi _m_|_n,..., h pi —n are not 
less than d false . □ 


Note that linkage criteria: single, complete and average satisfy assumption (24). 


Proof of Theorem la. Let us denote the path of nested models from step 4 of DMR algorithm 
by J = {Mo,..., M p _i}. The event of erroneous selection of the model by DMR algorithm is a 
subset of a sum of three events: 


{f/T}C{TM}U{T£ J, GIC t > min GIC M } 

MCT 

U {T £ J, GICt > min GIC M } 

TCM 

C {T <t J} U {GICt > min GIC Af } U {GICt > min GIC M }- 
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We will show that the probability of each of them tends to zero when n — >• oo. 

Using Lemma 5 let us consider constant /i* such that 

max t 2 M <p h* <p min t 2 M . 

mgm t MeMv 

It is obvious that cutting heights for true constraints for continuous variables are smaller than /i* 
and for false ones greater than h„. It also follows from Lemma 5 that dissimilarity matrices used 
in the algorithm are consistent with the partitions for model T. Then, applying Lemma 6 for each 
factor, we get that the cutting heights for aggregations compatible with the true partitions are not 
greater than h * and for incompatible ones not smaller than /i*. Hence, in DMR algorithm accepting 
true constraints precede accepting false ones, for large n the probability that the true model lies on 
the path of nested models tends to 1. 

Since min^c m RSSm > RSSp we have 

{GIC t > min GIC M } C {log RSS T > log RSS F + — } 
tcm n 

and from Lemma 2 we know that 

P (logRSS T > log RSS f + 0. 


It is obvious that 


{GIC t > min GIC A/ } C {log RSS T > min log RSS M - 

MCT mgm v n 

Let us notice from assumptions of theorem that ~< s<r 2 n+A — (l + s^n)- Then 


\T\r n . RSSm 

- > mm log — wrr- 

n MeM v RSSt 


and from Lemma 3 we know that 


2 log 1 + 


A \ 

— > mm log ——-— 

Sa 2 n J mgMv RSSt 


RSS 


log l 1 + s^) 


. RSSm\ p n 

> , mm log ->• 0. 

M£M v RSSt J 


Hence, DMR algorithm is a consistent model selection method. 

Proof of Theorem lb. Let us denote 

: y/n (p t ~ and b„ = y/n ([3f — (3*^ 


Sn — 


Notice that g n = b n if T = T. From Theorem la 

p(l{T^T) =0 

Since 


1. 


|l(f ^ T) = 0} C {b„l(f ^ T) = 0} , 


□ 
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^ P 

hence b„l(T ^ T) -0. From properties of the OLS estimator we have 

g n l(f = T)-^F(0,a 2 S r ). 

Henceforth, from multidimensional Slutsky’s theorem we get 

b n = b n l(T ^ T) + b n l(f = T) = b n l(f £ T) + g n l (T = T) -L- H(0, ct 2 £ t ). 

□ 
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